Data Storage Systems - System Design Guide
A comprehensive guide to data-related storage and processing systems from a system design interview perspective.
Table of Contents
- Core Concepts Overview
- Storage Systems
- Processing Engines
- System Design Considerations
- Modern Data Architecture Patterns
- Interview Checklist
- Common Interview Scenarios
1. Core Concepts Overview
ETL (Extract → Transform → Load)
Definition: Data pipeline pattern where data is extracted from sources, transformed (cleaned, aggregated, enriched), and then loaded into the target system.
Purpose:
- Complex business logic transformations before storage
- Data quality enforcement at ingestion
- Schema validation and normalization
- Data enrichment and integration from multiple sources
When to Use:
- Transformations are computationally intensive
- Need to reduce data volume before loading
- Target system has limited compute capacity
- Compliance requires data cleansing before storage
- Legacy systems with limited query capabilities
Typical Tools: Talend, Informatica, AWS Glue, Apache Airflow + Spark, Azure Data Factory
Example Flow:
MySQL DB → Extract → Spark Transformation → Load → Snowflake
Logs (S3) → Extract → Filter/Aggregate → Load → Redshift
Trade-offs:
- ✅ Reduced storage costs (only transformed data stored)
- ✅ Better performance for downstream queries
- ✅ Data quality guaranteed before loading
- ❌ Longer pipeline latency
- ❌ Less flexibility for reprocessing
- ❌ Schema changes require pipeline updates
ELT (Extract → Load → Transform)
Definition: Modern pattern where raw data is loaded first, then transformations run inside the data warehouse/lake.
Purpose:
- Leverage powerful warehouse compute engines
- Preserve raw data for flexibility
- Enable faster ingestion
- Support schema evolution
When to Use:
- Using modern cloud warehouses (Snowflake, BigQuery)
- Need to preserve raw data
- Transformations might change frequently
- Multiple teams need different views of same data
- Real-time or near-real-time requirements
Advantages:
- Schema-on-read flexibility
- Raw data always available for reprocessing
- Faster initial ingestion
- Easier debugging (raw data visible)
- Better support for iterative analytics
Example Flow:
API Data → Load to S3 → Load to Snowflake → Transform with dbt
Kafka Stream → Load to BigQuery → Transform with SQL
Trade-offs:
- ✅ Raw data preservation
- ✅ Faster time-to-insights
- ✅ More flexible reprocessing
- ❌ Higher storage costs
- ❌ Requires powerful warehouse compute
- ❌ Data quality issues propagate downstream
ETL vs ELT Comparison
| Aspect | ETL | ELT |
|---|---|---|
| Transform Location | Before load (external) | After load (in-warehouse) |
| Compute | Separate processing cluster | Warehouse compute |
| Data Volume | Reduced before storage | Full raw data stored |
| Schema | Fixed upfront | Flexible, schema-on-read |
| Latency | Higher (transform first) | Lower (load raw quickly) |
| Cost | Lower storage, higher compute | Higher storage, leverage warehouse |
| Use Case | Legacy systems, complex pre-processing | Modern cloud warehouses |
| Best For | Teradata, Oracle | Snowflake, BigQuery, Databricks |
2. Storage Systems
Data Warehouse
Definition: Structured, schema-on-write system optimized for analytics (OLAP) workloads.
Purpose:
- Centralized repository for structured business data
- Optimized for complex analytical queries
- Historical data analysis
- Business intelligence and reporting
Architecture Characteristics:
- Columnar storage: Optimized for aggregate queries
- Append-only: Immutable historical records
- Optimized joins: Efficient cross-table analytics
- Materialized views: Pre-computed aggregations
- Query optimization: Cost-based optimizers
Data Modeling:
- Star Schema: Fact table + dimension tables
- Snowflake Schema: Normalized dimension tables
- Fact Tables: Metrics, measurements (sales, clicks)
- Dimension Tables: Descriptive attributes (users, products, time)
Examples:
- Snowflake (cloud-native)
- Amazon Redshift (AWS)
- Google BigQuery (GCP)
- Azure Synapse Analytics
- Teradata (on-premise/legacy)
When to Use:
- Structured business analytics
- BI dashboards and reports
- SQL-heavy workloads
- Historical trend analysis
- Regulatory reporting
Trade-offs:
- ✅ Excellent query performance for analytics
- ✅ SQL interface familiar to analysts
- ✅ Strong consistency guarantees
- ✅ Optimized for aggregations and joins
- ❌ Schema changes are costly
- ❌ Not ideal for unstructured data
- ❌ Higher cost per GB than data lakes
- ❌ Not designed for real-time updates
Data Lake
Definition: Repository that stores raw data in native format - structured, semi-structured, and unstructured.
Purpose:
- Store massive volumes of raw data cheaply
- Support diverse data types and formats
- Enable exploratory analytics and ML
- Serve as single source of truth
Schema Approach:
- Schema-on-read: Structure applied at query time
- Format flexibility: JSON, CSV, Parquet, Avro, logs, images, videos
- No upfront modeling: Store first, structure later
Examples:
- AWS S3 + Athena/Glue
- Azure Data Lake Storage (ADLS)
- Google Cloud Storage + BigQuery
- Hadoop HDFS (on-premise)
Common Formats:
- Parquet: Columnar, compressed, great for analytics
- Avro: Row-based, schema evolution support
- ORC: Optimized columnar format for Hive
- JSON: Flexible but inefficient for large scale
When to Use:
- Storing raw logs, events, clickstreams
- IoT sensor data
- Machine learning training data
- Unstructured data (images, videos, documents)
- Data science exploration
- Cost-effective long-term storage
Challenges:
- Data governance and cataloging
- Data quality consistency
- Query performance can be poor
- Difficult to maintain ACID properties
- "Data swamp" risk without proper management
Trade-offs:
- ✅ Very low storage cost
- ✅ Handles any data type
- ✅ Massive scalability
- ✅ Great for ML and data science
- ❌ No ACID guarantees (traditionally)
- ❌ Poor query performance without optimization
- ❌ Governance challenges
- ❌ Requires data catalog tools
Lakehouse Architecture
Definition: Hybrid approach combining data lake flexibility with data warehouse performance and reliability.
Purpose:
- Single platform for all data workloads
- ACID transactions on data lake storage
- Schema enforcement with flexibility
- Support both BI and ML workloads
Key Technologies:
- Delta Lake (Databricks)
- Apache Iceberg (Netflix/Apple)
- Apache Hudi (Uber)
Features:
- ACID transactions on object storage
- Time travel and versioning
- Schema enforcement and evolution
- Unified batch and streaming
- Z-ordering and data skipping
Medallion Architecture Layers:
When to Use:
- Need both data lake economics and warehouse performance
- Supporting diverse teams (data scientists + analysts)
- Want to eliminate data silos
- Need strong consistency with S3-like costs
Trade-offs:
- ✅ Best of both worlds (lake + warehouse)
- ✅ Lower cost than pure warehouse
- ✅ ACID guarantees on cheap storage
- ✅ Eliminates data duplication
- ❌ More complex to set up
- ❌ Newer technology (less mature)
- ❌ Requires learning new concepts
3. Processing Engines
Apache Spark
Definition: Distributed compute engine for large-scale data processing (batch and streaming).
Purpose:
- Process massive datasets in parallel
- Complex transformations and aggregations
- Machine learning pipelines
- Unified batch and streaming
Architecture Components:
- Driver: Coordinates work, builds execution plan
- Executors: Perform computations on worker nodes
- Cluster Manager: Resource allocation (YARN, Kubernetes, Mesos)
APIs:
- Spark SQL (DataFrames) - Structured data processing
- RDD (Resilient Distributed Datasets) - Low-level API
- Structured Streaming - Stream processing
- MLlib - Machine learning library
- GraphX - Graph processing
Processing Model:
When to Use:
- ETL pipelines with complex logic
- Large-scale data transformations (TB/PB scale)
- Machine learning at scale
- Batch processing of historical data
- Stream processing with Structured Streaming (second-level latency acceptable)
Common Deployment Pattern:
Trade-offs:
- ✅ Massive scalability (handles PB-scale data)
- ✅ Rich transformation APIs (SQL, DataFrame, RDD)
- ✅ Supports multiple languages (Scala, Python, Java, R)
- ✅ Unified batch and streaming
- ✅ In-memory processing for speed
- ❌ High operational complexity
- ❌ Expensive for small workloads
- ❌ Requires cluster management expertise
- ❌ Higher latency than true streaming (Flink)
Snowflake
Definition: Cloud-native data warehouse with separated compute and storage architecture.
Architecture Innovation:
- Storage Layer: Centralized, automatic replication, micro-partitions
- Compute Layer: Multiple independent virtual warehouses
- Cloud Services: Metadata, optimization, security, transaction management
Key Features:
- Separation of Compute and Storage
- Zero-Copy Cloning: Instant data copies without storage duplication
- Time Travel: Query historical data (up to 90 days)
- Multi-cluster Warehouses: Auto-scaling for high concurrency
- Data Sharing: Share live data between accounts without copying
- Secure Views: Row/column-level security
When to Use:
- Business intelligence and analytics
- Data warehousing with variable workloads
- Need to scale read and write independently
- Multiple teams with different compute needs
- Cross-organization data sharing
- Ad-hoc analytics with varying concurrency
Design Patterns:
- Create separate warehouses for ETL, BI, Data Science
- Use clustering keys for large tables (>1TB)
- Materialize views for repeated aggregations
- Partition by time for time-series data
- Use multi-cluster warehouses for high concurrency
Trade-offs:
- ✅ Scales compute independently from storage
- ✅ No infrastructure management (fully managed)
- ✅ Excellent concurrency handling
- ✅ Pay only for compute used (per-second billing)
- ✅ Automatic optimization and tuning
- ❌ Can be expensive at scale (compute costs add up)
- ❌ Vendor lock-in
- ❌ Not ideal for real-time updates (optimized for batch/micro-batch)
- ❌ Limited support for unstructured data
4. System Design Considerations
Data Flow Architecture
Typical Modern Data Stack:
Key Design Questions
When designing a data analytics system, always ask these questions:
1. Data Characteristics
- Volume: How much data per day/hour/minute?
- Velocity: Real-time, near real-time, micro-batch, or batch?
- Variety: Structured, semi-structured, or unstructured?
- Veracity: What are data quality and consistency needs?
2. Access Patterns
- Read-heavy or write-heavy?
- Query latency requirements? (milliseconds vs seconds vs minutes)
- Number of concurrent users?
- Ad-hoc vs predefined queries?
- Point queries vs analytical aggregations?
3. Processing Requirements
- Transformation complexity? (simple filters vs complex joins)
- Need for joins across multiple sources?
- Stateful vs stateless processing?
- Reprocessing/backfill requirements?
- Data lineage and audit needs?
4. Scalability Needs
- Expected data growth rate?
- Geographic distribution requirements?
- Peak vs average load patterns?
- Data retention and archival policies?
Storage Selection Matrix
| Scenario | Best Choice | Reason |
|---|---|---|
| Raw logs & ML data | Data Lake (S3/ADLS) | Schema-on-read, cost-effective, supports any format |
| Business reports & BI | Data Warehouse (Snowflake/BigQuery) | Optimized for OLAP, SQL interface |
| Complex ETL pipelines | Spark / AWS Glue | Distributed compute, handles TB/PB scale |
| Unified raw + analytics | Lakehouse (Delta/Iceberg) | Combines lake economics with warehouse performance |
| Real-time analytics | Kafka + Flink + Warehouse | Low latency streaming pipeline |
| High concurrency BI | Snowflake / BigQuery | Scales compute separately, multi-cluster support |
| Operational analytics | Druid / ClickHouse | Fast aggregations, sub-second queries |
| Time-series data | InfluxDB / TimescaleDB / Druid | Optimized for time-based queries |
| Search & text analytics | Elasticsearch | Full-text search, log analytics |
| Graph relationships | Neo4j / Amazon Neptune | Graph traversal, relationship queries |
Scaling Strategies
Ingestion Scaling
Strategies:
- Message queues (Kafka, Kinesis) for buffering
- Partition by key for parallel processing
- Rate limiting at source
- Back-pressure mechanisms
- Dead letter queues for failed messages
Storage Scaling
Strategies:
- Separate compute from storage (Snowflake model)
- Horizontal partitioning (sharding by key)
- Time-based partitioning (year/month/day)
- Tiered storage (hot/warm/cold based on access patterns)
- Compression and columnar formats
Query Performance Optimization
Techniques:
- Materialized views for repeated aggregations
- Clustering keys / Sort keys for common filters
- Data skipping with statistics (min/max, bloom filters)
- Result caching for identical queries
- Query federation across multiple sources
- Partition pruning to scan only relevant data
Concurrency Handling
Strategies:
- Multi-cluster compute (Snowflake multi-cluster warehouses)
- Read replicas for read-heavy workloads
- Connection pooling to manage connections efficiently
- Query queuing and prioritization
- Workload management (separate ETL from BI queries)
5. Modern Data Architecture Patterns
Lambda Architecture
Concept: Separate batch and speed layers for comprehensive data processing.
Layers:
-
Batch Layer: Historical data processed with Spark/MapReduce on HDFS/S3
- Complete and accurate
- High latency (hours/days)
- Immutable dataset
-
Speed Layer: Real-time data processed with Flink/Storm from Kafka
- Low latency (seconds/minutes)
- Approximate results
- Compensates for batch layer lag
-
Serving Layer: Merged views from both layers (Druid, Cassandra, HBase)
- Queries combine batch and real-time data
- Application-facing API
Purpose: Handle both batch and real-time processing with different latency/accuracy trade-offs
Use Cases:
- Real-time dashboards with historical context
- Fraud detection with learning from history
- Recommendation systems
Challenges:
- Complexity of maintaining two separate code paths
- Data consistency between layers
- Operational overhead
Kappa Architecture
Concept: Simplified approach treating everything as a stream.
Simplified Approach:
- Single stream processing pipeline for all data
- Replayable message queue (Kafka with long retention)
- Same code handles both real-time and batch
Components:
- Stream Storage: Kafka with days/weeks of retention
- Processing: Flink, Spark Streaming, Kafka Streams
- Serving: Cassandra, Elasticsearch, Druid
Advantages:
- Single codebase (no duplicate logic)
- Easier to maintain and debug
- Flexibility to reprocess by replaying stream
When to Use:
- All data can be modeled as events/streams
- Team has strong stream processing expertise
- Want to avoid complexity of Lambda
Limitations:
- Requires replayable message queue
- Historical reprocessing can be slow
- May not suit all use cases
Medallion Architecture (Lakehouse)
Concept: Progressive data refinement through layers (Bronze → Silver → Gold).
Layer Characteristics:
| Layer | Data Quality | Schema | Purpose | Example |
|---|---|---|---|---|
| Bronze | Raw, as-is | Flexible | Ingestion, lineage | Raw JSON logs, CDC events |
| Silver | Cleaned, validated | Enforced | Standardized business data | Deduplicated users, validated transactions |
| Gold | Aggregated, enriched | Optimized | Analytics, ML features | Daily sales by region, user behavior features |
Benefits:
- Clear data lineage and quality progression
- Different SLAs per layer (Bronze: append-only, Silver: updates allowed, Gold: optimized)
- Supports both exploratory and production workloads
- Incremental quality improvement
- Easy debugging (can trace back through layers)
Implementation:
Bronze: Raw Kafka events → Delta table (append-only)
Silver: Deduplicate, validate, join dimensions → Delta table (upserts)
Gold: Aggregate by time/dimension, create features → Delta table (optimized)
6. Interview Checklist
When designing a data analytics platform in an interview:
Step 1: Clarify Requirements (5-10 minutes)
Questions to Ask:
- What are the primary data sources? (databases, APIs, logs, IoT)
- What's the data volume? (MB/GB/TB per day)
- What's the data velocity? (real-time, batch, micro-batch)
- What are the query latency requirements? (milliseconds, seconds, minutes)
- How many concurrent users? (10s, 100s, 1000s)
- What are the retention needs? (days, months, years)
- Are there compliance/regulatory requirements? (GDPR, HIPAA)
- What's the budget? (cost constraints)
Step 2: Design Ingestion (5-10 minutes)
Decisions to Make:
- Batch vs streaming ingestion?
- Need for message queue? (Kafka/Kinesis)
- CDC for databases? (Debezium)
- Schema validation strategy?
- Error handling and dead letter queues?
- Data partitioning strategy?
Step 3: Choose Processing Approach (5 minutes)
Decisions to Make:
- ETL vs ELT approach?
- Processing engine? (Spark, Flink, dbt)
- Transformation complexity?
- Orchestration tool? (Airflow, Dagster, Prefect)
- Incremental vs full refresh?
Step 4: Select Storage (10 minutes)
Decisions to Make:
- Data lake for raw data?
- Warehouse for analytics?
- Lakehouse for unified approach?
- Partitioning strategy? (time-based, key-based)
- Data retention and archival?
- File format? (Parquet, Avro, ORC)
Step 5: Plan Consumption (5 minutes)
Decisions to Make:
- BI tool integration? (Tableau, Looker, Power BI)
- API layer needed? (REST, GraphQL)
- ML model serving? (Feature store)
- Data access controls? (RBAC, row-level security)
- Caching strategy?
Step 6: Address Non-Functional Requirements (5-10 minutes)
Considerations:
- Scalability strategy? (horizontal, vertical)
- Monitoring and alerting? (data quality, pipeline health)
- Data quality checks? (validation, anomaly detection)
- Disaster recovery plan? (backups, replication)
- Security and compliance? (encryption, access control)
- Cost optimization? (tiered storage, auto-scaling)
7. Common Interview Scenarios
Scenario 1: Real-time Analytics Dashboard
Problem Statement: "Design a system to display user activity metrics on a website in near real-time. The dashboard should show active users, page views, clicks, and conversions updated every few seconds."
Requirements Clarification:
- Traffic: 100K requests/second
- Latency: Display updates within 5-10 seconds
- Metrics: Active users (1-min window), page views, clicks, conversions
- Retention: Hot data (7 days), historical data (1 year)
- Users: 100 concurrent dashboard viewers
Design:
Key Decisions:
-
Ingestion Layer:
- Kafka for high-throughput event ingestion (handles 100K events/sec easily)
- Partitioned by user_id for parallel processing
-
Stream Processing:
- Flink for stateful stream processing with exactly-once guarantees
- 5-second tumbling windows for aggregations
- Maintains state for active user tracking (1-minute window)
-
Storage Strategy:
- Redis: Hot data (last 5 minutes) for dashboard queries
- Snowflake: Historical data for trend analysis
- Dual write pattern: real-time to Redis, batch to Snowflake
-
API Layer:
- WebSocket/Server-Sent Events for pushing updates to dashboard
- Queries Redis for real-time metrics
- Queries Snowflake for historical context
Scalability:
- Kafka partitions: 10 partitions for parallelism
- Flink parallelism: 5 task managers
- Redis cluster: 3 nodes with replication
- Auto-scaling for API layer based on WebSocket connections
Scenario 2: Data Lake for Machine Learning
Problem Statement: "Design a data lake to store and process diverse data sources for training machine learning models. Data includes user behavior logs, transaction records, product catalog, and customer support tickets."
Requirements:
- Data sources: Logs (JSON), Databases (MySQL), Files (CSV), Text (tickets)
- Volume: 10 TB/day
- ML use case: Customer churn prediction, product recommendations
- Data scientists: 50 users running ad-hoc queries
- Training frequency: Daily batch jobs
Design:
Key Decisions:
-
Storage Format:
- Bronze (Raw): Original format (JSON, CSV) for lineage
- Silver (Curated): Parquet format for efficient analytics
- Partitioning: By date (
/year=2024/month=01/day=15/)
-
Schema Management:
- Glue Crawler for automatic schema discovery
- Glue Data Catalog as centralized metadata store
- Schema evolution support with Parquet
-
Feature Engineering:
- Spark jobs in AWS Glue for complex transformations
- Feature Store for ML features (training + serving)
- Separate training and inference pipelines
-
Access Patterns:
- Data Scientists: Athena for SQL exploration, Jupyter notebooks
- ML Training: SageMaker reading from S3/Feature Store
- Orchestration: Airflow for scheduling ETL and training jobs
-
Data Governance:
- Data lineage tracking with Apache Atlas
- S3 lifecycle policies for cost optimization (hot → cold storage)
- IAM roles for fine-grained access control
Cost Optimization:
- S3 Intelligent-Tiering for automatic cost optimization
- Parquet compression (Snappy) reduces storage by 70%
- Athena charges per query (pay only for scanned data)
- Spot instances for Spark/SageMaker training jobs
Scenario 3: Enterprise Data Warehouse
Problem Statement: "Design a centralized data warehouse for a retail company. The warehouse should consolidate data from multiple OLTP databases, support 500 concurrent BI users, and enable complex analytical queries for sales reporting, inventory management, and customer analytics."
Requirements:
- Data sources: 5 MySQL databases (orders, inventory, customers, products, stores)
- Update frequency: Every 15 minutes
- Query complexity: Complex joins across 10+ tables, window functions
- Users: 500 concurrent BI users
- SLA: 95% of queries under 10 seconds
- Data retention: 5 years
Design:
Key Decisions:
-
Data Ingestion Strategy:
- CDC with Debezium: Captures changes from MySQL databases
- Kafka: Buffers change events, enables replay
- 15-minute micro-batches: Balances freshness with load
- Separate ETL warehouse from BI warehouse for workload isolation
-
Data Modeling:
- Star schema for query performance (denormalized dimensions)
- SCD Type 2 for customer dimension (track history)
- Clustering keys:
datefor fact_sales,store_idfor fact_inventory - Materialized views: Pre-compute common aggregations
-
Snowflake Architecture:
- ETL Warehouse: X-Large, scheduled for 15-min loads
- BI Warehouse: Large with multi-cluster (1-10) for 500 users
- Auto-scaling: Automatically adds clusters during peak usage
- Auto-suspend: Warehouses suspend after 5 minutes of inactivity
-
Performance Optimization:
- Clustering keys on large tables (>100 GB)
- Materialized views for daily/weekly aggregations
- Result caching (24 hours)
- Search optimization service for point lookups
-
Semantic Layer (dbt):
- Centralized business logic and metric definitions
- Data quality tests (not null, unique, referential integrity)
- Documentation and lineage
- Incremental models for large fact tables
-
High Availability:
- Snowflake multi-AZ deployment
- Kafka replication factor: 3
- Spark cluster with multiple workers
- BI tool load balancing
Query Performance:
- 95th percentile query time: 8 seconds
- Cache hit rate: 60% (repeated queries)
- Concurrent query support: 500+ users
- Auto-scaling responds to load in
<1minute